KnowledgeMiner has a long
history. It is built on research activities and results of
different sciences like cybernetics, systems theory,
computer science, and mathematics for over 30 years now.
Different approaches to Inductive Learning from scientists
from the Ukraine, Germany, USA, Japan, China a./o. are
flowing together to provide a most powerful, unique, and
easy-to-use piece of data mining software in the market.
Used in the military area first, our goal is to integrate
these state-of-the-art data mining technologies into
software tools that can help solving real world problems in
economy, ecology, medicine, or sociology. Our international
team of researchers and developers is constantly working to
improve our products, and we keep your comments and
questions in mind. Please write
to our development
or research
staff. We will respond as quickly as possible.
You may also want to check out our data
mining service.
KnowledgeMiner
Discussion Forum
|
You can contribute to our discussion
forum by posting comments, results or answers. Here you
may also find information to your questions.
Please check out the link to the Easy
Learning to get a quick leg up in using KnowledgeMiner.
This page was created by our user Bert Altenburg. Thanks
Bert.
Here are some Frequently Asked Questions from our
customers.
Q: What does GMDH stand for?
A: GMDH - Group Method of Data Handling. It is a
statistical learning network technology using the
cybernetical approach of self-organization including
systems, information and control theory and computer
science. GMDH is not a traditional statistical modeling
method. It is an interdisciplinary approach to overcome some
main disadvantages of statistics and NN's. Below is a
description of GMDH from the preface to Farlow's Book.
"In statistics
nowadays there is a distinguishable trend away from the
restrictive assumptions of parametric analysis and toward
the more computer-oriented area of what is generally
known as nonparametric data analysis. One of the more
fascinating concepts from this new generation of research
is what is known as the GMDH algorithm, which was
introduced and is currently being developed by the
Ukrainian cyberneticist and engineer
A.G.Ivakhnenko.
What is known these days as a heuristic, the GMDH
algorithm constructs high-order regression-type models
for complex systems and has the advantage over
traditional modeling in that the modeler can more-or-less
throw into the algorithm all sorts of input/ output types
of observations, and the computer does the rest. The
computer self-organizes the model from a simple one to
one of optimal complexity by a methodology not unlike the
process of natural evolution. It is the purpose of this
book to introduce to English-speaking people the basic
GMDH algorithm, present variations and examples of its
use and list a bibliography of all published work in this
growing area of research."
From: S. J. Farlow, Self-Organizing methods in
Modeling. GMDH Type Algorithm (1984)
Here is what Prof. A.G. Ivakhnenko is saying:
"The Group Method
of Data Handling (GMDH) is self-organizing approach based
on sorting-out of gradually complicated models and
evaluation of them by external criterion on separate part
of data sample. As input variables can be used any
parameters, which can influence on the
process.
Linear or non-linear,
probabilistic models or clusterizations are selected by
minimal value of an external criterion. The sorting
algorithms are rather simple and they get information
directly from data sample. The effective input variables,
number of layers and neurons in hidden layers, optimal
model structure are determined automatically. This is
based on that fact that external criterion characteristic
have minimum during complication of model structure. GMDH
inductive approach is different from commonly used
deductive techniques and networks.
The GMDH was developed
for complex systems modelling, forecasting and data
mining, analysis of multivariate processes, decision
support after "what-if" scenario, diagnostics, pattern
recognition and clusterization of data sample. Since 1968
many books, more than 230 doctoral dissertations were
devoted to investigations in very different fields. It
was proved, that for inaccurate, noisy or small data can
be found best optimal simplified model, accuracy of which
is higher and structure is simpler than structure of
usual full physical model. For real problems with noised
or short data samples, simplified forecast models becomes
more effective.
Recent developments of
the GMDH have led to neuronets with active neurons, which
realize twice-multilayered structure: neurons are
multilayered and they are connected into multilayered
structure. This gives possibility to optimize the set of
input variables at each layer, while the accuracy
increases. The accuracy of forecasting, approximation or
pattern recognition can be increased beyond the limits
which are reached by neuronets or statistical
methods.
Not only GMDH
algorithms, but many modeling or pattern recognition
algorithms can be used as active neurons. Its accuracy
can be increased in two ways:
- each output of
algorithm (active neuron) generate new variable which
can be used as a new factor in next layers of
neuronet;
- the set of input
factors can be optimized at each layer. In usual
once-multilayered NN the set of input variables can be
chosen once only. The output variables of previous
layers in such networks are very effective secondary
inputs for the neurons of next
layers.
Neuronets with active
neurons and basic GMDH algorithms was described in
Selforganization
of Neuronets with Active Neurons.
Ivakhnenko,
A.G., Ivakhnenko, G.A., Mueller, J.A.; Pattern
recognition and image analysis, 1994, vol.4,
no.2;
Self-Organization of
Nets of Active Neurons.
Ivakhnenko
A.G., Mueller J.A.; SAMS, 1995, vol.20,
pp.93-106.
The GMDH theory was
also published in
Inductive
Learning Algorithms for Complex System
Modeling.
Madala H.R.
and Ivakhnenko A.G., 1994, CRC Press;
and
Self-organizing
Methods in Modelling
(Statistics: Textbooks and Monographs, vol.54),
Farlow,
S.J. (ed.), 1984, Marcel Dekker Inc."
You can find a short intro in Paper
1 or a comprehensive reading in the the new book by
Mueller/Lemke "Self-Organising Data
Mining". You may also want to look at the publications
area for more information.
Q: Would you consider your products suitable for
financial and product-demand forecasting using numerous
variable inputs?
A: Yes, this is one of the primary application
fields for KnowledgeMiner. In contrast to statistics or NN's
you can use more variables than samples available for
modeling. For example, you can create a prediction model
(lin. system of equations e.g.) of 40 variables, but only 30
observations for each variable are available. You can
consider up to 500 input variables (lagged and unlagged) in
KnowledgeMiner to model complex time processes.
Additionally, KnowledgeMiner has implemented Analog
Complexing as an extremely powerful prediction technique for
fuzzy processes like financial markets. KnowledgeMiner when
used on financial markets could really strike gold!
Q: It "feels" like KnowledgeMiner might assist in
detecting relationships among certain patient groups by
clinical criteria vs. fluid measurements that may be missed
by an individual. If I understand the application of
KnowledgeMiner, I believe I should be able to take our
database of eye features along with the diopter measurements
for each patient in the database, plug those into
KnowledgeMiner and then KnowledgeMiner will derive an
equation for calculating the diopter measurement of a
patient as a function of the patient's image features. Is
this true?
A: Yes, exactly. This is something KnowledgeMiner
can do.
Q: First, I'd like to complement you on your choice of
platform ;-). With Motorola's new math libraries and the
higher clock speeds of their chips, math intensive
applications such as KnowledgeMiner (KM) are best run on a
Mac; Byte's recent benchmarks show the new Macs running
twice as fast in SpecInt and 50% faster in SpecFPU than
comparable P5 or P6 chips running at the same speed.
I'm a defense contractor in the U.S. and I also work as a
consultant doing image processing programming and object
classification work. I downloaded the KM Demo last night and
was very impressed with what I think I saw! Congradulations
on a very impressive algorithm and its implementation into a
GUI that everyone can use. One of the difficulties with
Statistical Pattern Recognition in my application is that
one might not get a sufficiently sophisticated classifier to
give the best possible results (for example using a linear
classifier instead of a more complex quadratic classifier).
It appears that KM does not suffer from this problem because
is appears to produce a bonafide nonlinear equation which
should optimally accommodate any irregular shaping of the
class populations in feature space. Is this true?
A: Yes, you are correct. One important feature of
KnowledgeMiner is that it creates models in an evolutionary
way: From very simple models to increasingly more complex
ones. It stops automatically when an optimally complex model
is found. That is, when it begins to overfit the design data
(the data used to create relationships between
variables).
Q: The possibility of time lag model is really
interesting too. In human training studies, the number of
measures per year is very low (2-6), compare to testing
variables (10-20). How many subjects are necessary
too?
A: The same is true if you want to create a
dynamic model. In contrast to statistics or Neural Networks,
KnowledgeMiner can deal with a very small number of cases
(6+). In fact, the number of cases used for modeling can be
smaller than the number of variables (so-called
under-determined tasks). So, it is really possible for you
to use 10 variables and 6-10 samples only for creation of a
linear system of equations.
Q: What would be the largest table (columns and rows)
KnowledgeMiner could accommodate if allocated 100 MB of
RAM?
A: The table contains approximated values as an
orientation:
100 rows
|
500 inputs
|
200
|
350
|
300
|
280
|
400
|
240
|
500
|
210
|
1000
|
150
|
2000
|
110
|
5000
|
70
|
KnowledgeMiner optimizes several modeling tasks, so it is
not possible to give exact values in advance. The real
memory requirements may actually be smaller.
Q: We've purchased NGO and our company is windows based.
I'll be doing KM at home on my Performa 6400/180. If I get
the full version what kind of performance can I expect on
the performa?
A: Two aspects: speed and RAM space. 180 MHz are
good even for large problems. For small modeling problems
(< 50 inputs and < 100 samples) it will take a few
minutes and let's say 100KB-2MB of RAM temporarily to create
a GMDH model (once familiar with it). However, especially
RAM requirements will grow rapidly (10-100MB and more) with
larger modeling problems (>100 inputs and > 500
samples). It can take then up to an hour or two to get a
model. Compared to alternative methods with this kind of
problem complexity, which would take days or weeks.
Q: How many records of data can I put into KM if I have
11 inputs and 6 outputs? Will I need a different data sheet
for each output even if the input values are the same? Is
copying and pasting the easiest way or save subsequent
output entries with different names?
A: No, KM is "Un-PC" too! It can handle up to 500
inputs (including lagged variables for dynamic modeling) and
a virtually unlimited number of outputs (read: models) in a
single document using the same physical data sheet without
copying/pasting any data. All models are stored in a model
base and for each column of the sheet, 4 different model
types can be created and stored simultaneously: a time
series model (auto-regressive), an input-output model
(static or dynamic), a system model
(multi-input/multi-output) and an Analog Complexing
model.
KM 3.0 has implemented a third modeling method:
self-organizing fuzzy-rule induction or Fuzzy-GMDH. So, a
fifth model can be added to the model base for each column.
Also, KM 3 will extend the spreadsheet from actually 1,000
rows up to 10,000 rows.
Q: I've recently downloaded KM and I am wondering how it
compares to NGO for windows. I'm going to compare models and
"closeness" of fit between the two but I'm concerned the
demo version will cut me out at 4 levels and not fit as well
as it may have if I had the full potential of the full
edition. What are advantages of KM over NGO or even more
expensive software packages such as GenSym?
A: This has been described a little elsewhere in
this FAQ. An important advantage is also that KM always
produces a model description usable for interpretation and
analysis. You can see why results are as they are and what
variables KM has selected out as relevant. For fuzzy-rule
induction, for example, you will get models in an almost
natural language as this model from the wine recognition
example shows:
IF N_Flavanoids &
NOT_N_Nonflavanoid phenols & NOT_N_Color
intensity
OR NOT_N_Ash &
N_OD280/OD315 of diluted wines
OR NOT_N_Color intensity
& NOT_P_Magnesium &
N_Flavanoids
& NOT_N_Alcalinity of ash
& NOT_P_Hue
THEN wine_cultivar #3
The main difference, however, is that KM, in addition to
the black-box approach and the connectionism of NNs, is
based on a third principle called inductive
self-organization.
Inductive self-organizing modeling theory and praxis have
proven that "closeness of fit" cannot be the only criterion
for finding a "best" model. It is necessary, during
modeling, to validate each model candidate's performance on
some new data. If this step is missing (as commonly seen in
neuro-fuzzy-genetic approaches), models will tend inherently
to be overfitted. This is, because it is always possible (at
least theoretically) to formulate a model that fits any
given (finite) learning data set with almost 100% accuracy -
driven by the rule "the more complicated the model is, the
more accurately it will fit the given data." This is also
true for completely random samples. For noisy data, this
means that, at a certain point in modeling, the model begins
to fit the noise (overfitting), which results in bad or
catastrophic performance on new data. The model fits better
the design data, but at the same time, it loses accuracy
when applied to some previously unseen data. It is too
complex. So, the problem is to find that point where a model
begins to reflect random relations. This we call creating an
optimally complex model. GMDH can do this.
|